I explore the properties of Red Wine. The key goal in this study is to determine which chemical properties influence the quality of red wines.
I will use R to begin exploring the data and trying to discover interesting patterns and relationships.
# load the ggplot graphics package and the others
library(ggplot2)
library(GGally)
library(scales)
library(memisc)
## Loading required package: lattice
## Loading required package: MASS
##
## Attaching package: 'memisc'
##
## The following object is masked from 'package:scales':
##
## percent
##
## The following objects are masked from 'package:stats':
##
## contr.sum, contr.treatment, contrasts
##
## The following objects are masked from 'package:base':
##
## as.array, trimws
library(gridExtra)
library(grid)
# Load wines csv
wines <- read.csv('wineQualityReds.csv')
summary(wines)
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
From this box plot mapping quality of the wine, we can see most (between 25th - 75th percentile) fall within the range of 4 and 7, with the majority getting a average rating of 5 and 6.
This histogram confirms the visualization from the box plot.
Fixed acidity is normally distributed. - Mean of 8.32 - median of 7.90 - Minimum of 4.60 - Maximum of 15.90
Volatile acidity is also normally distributed. - Mean of 0.5278 - Median of 0.5200 - Minimum of 0.1200 - Maximum of 1.5800
Risidual sugars tend to be very low in red wine. - Mean of 2.539 - Median of 2.200 - Minimum of 0.900 - Maximum of 15.500
On the other hand, citric acid tend to be distributed more to the tail. - Mean of 0.271 - Median of 0.260 - Minimum of 0.000 - Max of 1.000
When controlling for outliers the amount of chloride in red wine has a normal distribution.
When controlling for the outliers at the tail and head of the distribution of free sulfur dioxide, the majority of wines have a low amount of it.
pH is normally distributed. - Mean of 3.311 - Median of 3.310 - Minimum of 2.740 - Maximum of 4.010
Sulphates is also normally distributed, when controlling for the long tail. - Mean of 0.6581 - Median of 0.6200 - Minimum of 0.3300 - Maximum of 2.000
The majority of red wines tend to have lower alcohol levels. - Mean of 10.42 - Median of 10.20 - Minimum of 8.40 - Maximum of 14.90
Finally, the majority of red wines were given an average quality rating. - Mean of 5.636 - Median of 6.000 - Minimum of 3.000 - Maximum of 8.000
To confirm this initial assessment, I added a new categorical variable.
wines$rank <- ifelse(wines$quality <= 4, 'bad', ifelse(
wines$quality < 7, 'average', 'good'))
wines$rank <- ordered(wines$rank,
levels = c('bad', 'average', 'good'))
summary(wines$rank)
## bad average good
## 63 1319 217
The tidy data set used contains 1,599 red wines with 11 variables on the chemical properties of the wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).
Looking at the quick summary of variables, we can see that we have 11 variables which describe the properties of Red Wine.
fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines
residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
chlorides: the amount of salt in the wine
free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
density: the density of water is close to that of water depending on the percent alcohol and sugar content
pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant
alcohol: the percent alcohol content of the wine
Output variable (based on sensory data): - quality (score between 0 and 10)
The main features of interest is quality. The key objective of this analysis is to determine which chemical properties influence the quality of red wines.
I believe the acidity, pH and sulfur are most likely to contribute to the quality of the red wine.
Citric acid was the only type of acid that was not normally distributed. After some investigation, I found that a significant number of data rows had a value of 0 for citric acid. However, doing some quick research, this was likely intentional and not an error as citric acid is used less frequently in wine compared to tartaric and malic due to the aggressive citric flavors it can add to the wine.
wine_subset <- wines[,2:13]
colnames(wine_subset) = c("F.A", "V.A", "Cit.A", "Sug", "CI",
"F.S02", "S02", "Dens", "pH", "S04","Alc", "Q")
ggpairs(wine_subset, params=list(corSize=3)) + theme(axis.text = element_blank())
I first use a matrix to quickly find and display some potential relationships between variables.
Using scatter plots, I can quickly see the patterns between the two variables and whether the correlation figures seem correct.
From these plots, I observed: - There is a strong positive correlation between alcohol and quality (0.476) - There is strong postive correlation between density and fixed acidity (0.669) - There is a strong negative correlation between pH and fixed acidity (-0.683) - Strong negative correlation between pH and citric acid (-0.542) - Strong negative correlation between alcohol and density (-0.496)
Not knowing much about wine, I found many interesting relationships. The first of which are that citric acid, pH and alcohol all have strong relationships with the quality of the wine.
In addition, a wine that has higher density generally has a higher fixed acidity. A wine with a high alcohol level also tends to have higher chlorides. Finally, a wine that has higher pH will have a lower level of citric acid and fixed acidity.
The strongest relationships found were between alcohol and chlorides as well as alcohol and quality.
Looking at the same plots as in the bivariate section, except this time I use rank instead of quality.
Taking a closer look at the relationship between alcohol and pH, using rank we are able to better see the distinction between the three categories of wine. Despite my hypothesis that more basic wines would be rated less highly, this does not appear to be the case.
However, there definitely is a strong correlation between alcohol % and quality. I chose to use a box plot in order to display the quartile information and show the distribution of the wines by quality effectively.
This plot was created to show the majority of wines were given an average rating and therefore the dataset is normally distributed.
This plot was created to show the relationship between pH, alcohol and quality and how quality red wines tend to have a higher alcohol level coupled with lower pH.
These box plots emphasize the strong positive correlation between alcohol and wine quality.
Through this assignment I was able to learn more about wines and the factors attributed to their quality. However, as the majority of wines were given an average rating this dataset is not necessarily the end all of analysis in this area. This is likely due to the fact that the quality of wines are often subjective for each individual.
Through this analysis I think the largest struggle for me was deciding the appropriate plots to use in the analysis. Ultimately I decided that distribution analysis seemed to be the best way to approach the ranking of quality, so I relied mostly on histograms, scatter plots and box plots.
I think the greatest success was having a set of graphs by the time I arrived at the final section with which I can use to come up with a reasonable story and conclusion. Coming from a business background, it was exciting approaching a problem from an analytical perspective rather than merely empircally or theoretically.
In the future, I would like to do a more detailed breakdown of the different factors that affect the quality of the red wine. Perhaps data on the demographics and psychographics of the wine experts would be interesting to look at as well. I feel that my results merely confirmed suspicions I had before combing through the dataset.